Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support option for continuous monitoring token usage in streaming response #111

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sukumargaonkar
Copy link

Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well.

This PR also introduces monitorContinuousUsageStats flag in config for external process
The flag controls if external process monitors every response-body chunk for usage stats
when true, it will monitor for token metadata usage in every response-body chunk received during request in streaming mode (compatible with vllm's 'continuous_usage_stats' flag)
when false, it will stop monitoring after detecting token metadata usage after finding it for the first time.
(compatible with OpenAI's streaming response (https://platform.openai.com/docs/api-reference/chat/streaming#chat/streaming-usage))
Only affects request in streaming mode

@sukumargaonkar sukumargaonkar requested a review from a team as a code owner January 16, 2025 22:37
Comment on lines +71 to +80
// MonitorContinuousUsageStats flag controls if external process monitors every response-body chunk for usage stats
// when true, it will monitor for token metadata usage in every response-body chunk received during request in streaming mode
// compatible with vllm's 'continuous_usage_stats' flag
// when false, it will stop monitoring after detecting token metadata usage after finding it for the first time.
// compatible with OpenAI's streaming response (https://platform.openai.com/docs/api-reference/chat/streaming#chat/streaming-usage)
// Only affects request in streaming mode
MonitorContinuousUsageStats bool `yaml:"monitorContinuousUsageStats,omitempty"`
Copy link
Member

@mathetake mathetake Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove the change related to this? I think this is another issue and metadata is not cumulative so basically it's overriding previous ones if it's emitted in the middle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a property on the AIServiceBackend as only certain backend supports this, e.g vLLM service backend.

.gitignore Outdated Show resolved Hide resolved
@mathetake
Copy link
Member

mathetake commented Jan 16, 2025

sorry this feels a conflict with #103 i would appreciate it if you could stop it until it lands - I think the PR will supersede this PR besides vllm stuff part

@mathetake
Copy link
Member

@sukumargaonkar thank you for waiting - #103 has landed so could you rework the PR and focus on the vllm stuff?

@mathetake mathetake self-assigned this Jan 18, 2025
@mathetake
Copy link
Member

ping

@mathetake
Copy link
Member

@sukumargaonkar do you still want continue the PR here? I will close this in a few days if there's no response as there's no reason to keep it open

@sukumargaonkar
Copy link
Author

yes, will rebase and include only the vllm specific changes

@mathetake
Copy link
Member

great!

@mathetake
Copy link
Member

checking - how long do you need to rework the pr here? @sukumargaonkar it should be pretty straightforward right? i wonder if this can get in the initial release.

Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well.

Signed-off-by: Sukumar Gaonkar <[email protected]>
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for envoy-ai-gateway canceled.

Name Link
🔨 Latest commit a4de9ad
🔍 Latest deploy log https://app.netlify.com/sites/envoy-ai-gateway/deploys/6796e5e6005b2000084daca2

@yuzisun yuzisun changed the title Extract Input/Output token usage from request. feat: support option for continuous monitoring token usage in streaming response Jan 27, 2025
Copy link
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of question:

  • do we really want to make this as an option? even if so, adding option only to filterconfig.go doesn't make sense.
  • how does this apply to aws?

@yuzisun
Copy link
Contributor

yuzisun commented Jan 28, 2025

a couple of question:

  • do we really want to make this as an option? even if so, adding option only to filterconfig.go doesn't make sense.
  • how does this apply to aws?

I made a comment above, it should be a flag for AIServiceBackend if continuous token usage monitoring is supported.

@mathetake
Copy link
Member

mathetake commented Jan 28, 2025

@yuzisun I think i should've rephrased the question: why don't we just enable this logic by default (== removing the .bufferingDone flag)? In anyways, we parse all event by default (since the used token event is always the last one), so this option doesn't help us at all in terms of the computation. (On that note, i think i shouldn't have introduced bufferingDone in the firs tplace)

@sukumargaonkar
Copy link
Author

i agree, its better to parse every msg chunk while streaming to check for presence of usage data
also noticed that in case of vllm, when usage data in included in each chunk it includes usage of all chunks before as well.

i don't think we need to do the aggregation here

p.costs.InputTokens += tokenUsage.InputTokens
p.costs.OutputTokens += tokenUsage.OutputTokens
p.costs.TotalTokens += tokenUsage.TotalTokens

thoughts @mathetake ?

@mathetake
Copy link
Member

also noticed that in case of vllm, when usage data in included in each chunk it includes usage of all chunks before as well.

i see that's a good finding. is there any documentation about that? then let's change the aggregation as well as removing the bufferingDone flag

@mathetake mathetake added this to the v.0.1.0 milestone Jan 29, 2025
@mathetake
Copy link
Member

ping

@mathetake
Copy link
Member

mathetake commented Jan 31, 2025

@sukumargaonkar hey are you still interested in this ? I would prefer the fast turnaround of a single PR and avoid keeping the context for a long time, so i would appreciate if you could rework the PR soon. Otherwise, i will close and redo it by myself. Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants